Sequencing and Raw Sequence Data Quality Control ◾ 33
the proportion of the reads which contain the adaptor sequences at each position. Known
adaptor sequences and description are stored in the “adapter_list.txt” file as shown in
Figure 1.25.
K-mers (sequences of k size of bases) are formed from the adaptor sequences in the
“adapter_list.txt” file and then the program searches for these k-mers to report the total
percentage of the reads which contain these k-mers. The report may discover the sources of
bias due to contaminating adaptor dimers in the library.
A warning is raised if any sequence is present in more than 5% of all reads, and a failure
occurs if any sequence is present in more than 10% of all reads.
Figure 1.26 shows a FASTQ file with raw reads without adaptor content (left) and a
FASTQ file with reads with failed metric due to significant content of Illumina Universal
Adaptor.
1.5.12 K-mer Content
The K-mer content graph plots the count of each short nucleotide of length k (default k = 7)
against positions in reads. In a normal k-mer content, k-mers are expected to be repre-
sented evenly across the length of the reads. The k-mer content graph shows the positions
for the only six most significant k-mers (Figure 1.27). The list of k-mers which are present
at specific position with significant abundance will be reported in a table including k-mer
sequences, counts, p-values, and expected position. Caution is required when the reads are
from RNA-Seq libraries; the significant k-mers may be due to highly expressed gene, and
hence, they will have a biological importance.
FIGURE 1.25 Some known adaptor sequences.
FIGURE 1.26 Adaptor content graphs.